International Journal of Engineering and Technical Research (IJETR) 
ISSN: 2321-0869 (O) 2454-4698 (P), Volume-4, Issue-1, January 2016 


Statistical analysis for the characterization of the 
wastewater in the influent of a treatment plant 

(Case of study) 

Facundo Cortes-Martmez, Alejandro Trevino-Cansino, Agustm Saenz-Lopez, Rajeswari 

Narayanasamy 


Abstract— A problem facing wastewater treatment systems is to 
identify the discharges of wastewater from the industry and 
classify the concentration of pollutants in it that are received in 
the treatment plant, weak, medium or strong. So a statistical 
analysis of the concentration of pollutants in the influent, can 
guide the operators of these systems to control decisions. 
Statistical analysis was carried in a treatment plant and the 
concentration data was considered in the influent from an 
external source. The following parameters were analyzed: 
biochemical oxygen demand and chemical oxygen demand. 
According to the results, Wastewater was classified as middle 
class, and some industrial wastewater discharges could be 
identified. The criteria to identify the possible infringing users 
has been included. 

Index Terms — Statistical analysis, discharge control, sewage, 
industrial discharges, biochemical oxygen demand. 

I. Introduction 

The information that is generated by the characterization of 
wastewater in the influent of a municipal treatment plant 
provides essential information like the concentration 
tendency of some or one specific pollutants. According to The 
Environmental Protection Agency (EPA) of United States of 
America. The discharging regulation to commercial, 
industrial, and service sewers systems is a total challenge for 
authorities in charge, because of its scattering, and the needed 
hard work to control them. (EPA, 2002; 2003; Wills et al., 
2010). Therefore, the quoted agency recommends to identify, 
track, and control the non-domestic wastewater discharges; 
that is, to prevent possible interference in biological treating 
at the municipal treatment plant (EPA, 1987; 1991, UNAM, 
2000). A consequence of high pollutants content, like 
biochemical oxygen demand (BOD), is that treated 
wastewater does not meet the quality regulation for pouring 
receptive bodies NOM-OOl-ECOL-1996 (DOF, 1997). This 
represents an economical problem for the plant, the town 


council, and the water operator corporation. 

Some possible relevant problems that can affect the municipal 
pipes, due to the fact that pouring pollutants higher than the 
indicated by the norm, are: a temperature increase over 40 
degrees in wastewater moving through pipes systems, fat and 
solid blocking pipes, and the risk of explosions in sewers 
systems, among others. (DOF, 1998; EPA, 1987; CNA e 
IMTA, 2000; 2007). 

Two parameters are being analyzed in this project: 
Biochemical oxygen demand (BOD) and chemical oxygen 
demand (COD). BOD is the among of dissolver oxygen that 
microorganisms require to oxidize organic substance. The 
process lasts 5 days, for that reason the parameter is indicated 
as BOD 5 (Crites and Tchobanoglouus, 2000; CNA e IMTA, 
2000). COD measures organic substance in wastewater as an 
indirect way; due to dichromatic potassium is used for 
oxidation. The bonds of this parameter are generally higher 
than BOD 5 since COD oxidizes any kind of substance; while 
BOD 5 oxidizes only those ones which can be biologically 
degraded (Metcalf & Eddy, 1991; CNA e IMTA, 2000). 

The main aim of this article was to apply the descriptive (non 
parametrical) statistics to concentration data BOD 5 , COD in 
the influent of the treatment plant, all this with the purpose of 
classifying the state of the wastewater and identifying the 
possible discharges of it within industrial process. 

The contribution of this document was the analysis criteria 
and the results interpretation of the statistical analysis, applied 
to the characterization of wastewater in the treatment plant. 

II. Methodology 

A. Classification of wastewater. 

According to Metcalf & Eddy (1991) the typical 
concentration of wastewater is composed by three states: 
weak, medium, and strong. On table 1, a segment of the 
parameters and concentrations is being shown. 


Table 1. Composition of gross domestic wastewater. Metcalf & Eddy (1991). 


Pollutants 

Units 

Concentration 

Weak 

Medium 

Strong 

Biochemical oxygen Demand 5 days, 20 
°C (BOD 5 ) 

mg/L 

110 

220 

400 

Chemical Oxygen Demand (COD) 

mg/L 

250 

500 

1000 

Settleable solids 

mL/L 

5 

10 

20 
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Next, the basic nomenclature of descriptive statistics is being 
described. The mathematical expressions and nomenclature 
were taken from Perez (2002) and Guarin (n.d.). 
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B. Nomenclature 
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Arithmetic mean or average 
Values of the variable X 

Number of observations 
Sign sum 2 
Frequency 

Median 

Lower limit of the range where 
the median is placed 
Cumulative frequency preceding the 

median interval 
Median frequency range 
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Lower limit of the range that contains the 
Quartile 
Range 

Maximum value of the variable 
Minimum value of the variable 
Fisher asymmetry coefficient 
Fisher kurtosis coefficient 
Lower limit of the modal interval 
Frequency of the modal class 

Frequency of premodal class 
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Cumulative frequency to the 
pre containing quartile range 
Absolute cumulative frequency. 
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Cumulative relative frequency. 

Relative frequency. 

Absolute frequency.. 

Third order moment with regard to the 
average 

Cubed Standard Deviation 
Fourth order moment with regard to the 
average 


C. Frequency distributions. 

In statistics, frequency is often referred to as the number of 
times you repeat a variable, also called absolute 
frequency C n r) . After being divided by the total of the 
observations, it is called relative frequency Mr)- 


The absolute cumulative frequency allows to know the 
number of cases that are located below a certain value 

N i ='£n iwith J = l...i. (2 ) 

The relative cumulative frequency refers to the cumulative 
absolute frequency divided by the total number of values of 
the variable under study. 

f-N± ( 3 ) 

r 1 n 

Table 2 shows the way the frequency distribution is usually 
presented. 


Table 2. Table of frequencies. 
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Source: Perez (2002). 

frequency interval in question. According to Perez (2002), a 
frequencies in intervals frequency histogram is the representation of data which can 
is proportional to the be defined in three important properties of the distribution: 


A histogram includes the variable 
where the area of the rectangle 
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shape, central tendency and dispersion. R = x ma x ~~ Xmin (8) 


The interquartile range is the difference between the third and 
D. Measures of central tendency or position first q uar fit es: i n this range, 50 percent of the data is included. 


The main measures of central tendency are: the arithmetic 
mean or average, median and mode. 

The arithmetic mean is the sum of all divided by the number 
of data values. Some of its properties are: all variables are 
involved; the value is unique and is interpreted as a balance. 
In equation (4) the expression for grouped variables is shown. 

m 

V r 

— = x l/l + x 2/2 + '" +x ;/; + -" +x w/ m = 1 Xl ' [ (4) 
n n 


Median is the value of the variable considered below half of 
the data, the other half is located above, this measure is used 
in nonparametric statistics. Some properties are: it is less 
sensitive to outliers than the average and dispersion does not 
affect the value: it is considered more real than the arithmetic 
mean. The information is grouped in equal intervals, then the 
median is calculated using the following expression: 


n r 

Me = LI + ±A 

fi 


( 5 ) 


Trimmed mean: is a more robust measure, as it is less 
sensitive to outliers. It deletes a number of observations, both 
above and below the variable under study. 


Range I Q = Q 3 -Q 1 


( 9 ) 


The box plots show median, interquartile range, and outliers 
of the variable under study. The lower edge of the box 
corresponding to the first quartile (25 percent); while the 
higher corresponds to 75 percent. To set much lower limits 
outliers as top of the box are determined. To accomplish the 
foregoing is considered the breadth of the box, that is, the 
interquartile range. The first limit is obtained as 1.5 times the 
IQ Range, while the second limit is set 3 times the breadth of 
the quoted range. 

F. Skewness and kurtosis 


Fisher coefficient is a measure of asymmetry, which analyzes 
the proximity of the data average ). Therefore if the 
coefficient of Fisher (gi = 0), the distribution will be 
symmetrical if gl <0 asymmetric negative (left), and if gi> 0 
asymmetric distribution is positive (right). To analyze the 
asymmetry coefficient is first necessary to calculate a statistic 
known as time of order three with respect to the mean ( m 3)' 
asymmetry coefficient is determined by expression (10). 


Sl = 


m 3 



( 10 ) 


When data is grouped into intervals of equal size, mode is 
calculated using the following expression. 


Mo = Li + 


f m f (jn — 1 ) 


m f (m- 1) f (m + 1) 


( 6 ) 


Quantiles, percentiles, quartiles and deciles. It is the dataset 
but formed into groups with the same number of elements: In 
the case of quartiles variable is divided into four groups with 
the same number of data. 


Kurtosis evaluates the distribution frequency in the central 
region with regard to the normal curve. It is said that a 
distribution is mesokurtic (equal to the normal curve) when 
the kurtosis, g 2 = 0; be leptokurtic (pointing higher than the 
normal curve) where g 2 > 0; and platykurtic (pointing lower 
than the normal curve), where g 2 <0. The kurtosis is 
determined using equation (11). 
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E. Measures of dispersion 

They indicate the concentration of data with regard to 
measures of centralization. They are divided into: variance, 
standard deviation, coefficient of variation and range. In the 
present paper only the range will be used, as this study is only 
an exploratory analysis of the original data without 
transforming it (non-parametric statistics). The range refers to 
the difference between the highest and lowest number of 
distribution. 


The measurement data of wastewater was taken from an 
external source: Lichman (2013), which were also used in the 
following studies: Belanche et al (1992); Garcia (1993) and 
Bejar et al (1993). The measurements were only taken from 
the influent in a treatment system: 1046 measures. Due to the 
volume of information, it is not included in this document; 
however it can be verified in the already quoted resource 

III. RESULTS AND DISCUSSION 

A. Biochemical oxygen demand (BOD) 

In figure 1 the frequency histogram is presented. 
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Figure 1. Histogram to BOD 5 in the influent of the treatment 
plant. 

Class mark represents the concentration of organic matter in 
the range of 19 mg / L. 

Shape measures: the asymmetry of the distribution was 
positive: 0.78 versus normal zero. According to Figure 1 the 
left side plummets; while right side goes down gently. The tail 
of the distribution is larger with values above average. 1.36 
kurtosis proved less than 3 so the distribution is platykurtic. 
The five classes in the peak are the concentrations of organic 
matter appearing more frequently. 

Position measurements: the statistics calculated by the SPSS 
program were: with a confidence of 95 percent is estimated 
that the values are between 183 and 194 upper and lower 
limits respectively, average is 189, trimmed 5 percent mean 
equals to 186, Medium is 183, and fashion turned out to be 
133. The relationship between mean and median for positive 
asymmetry is that the average is greater than the median. It is 
noted that the trimmed mean is the closest to the arithmetic 
median. 

Scattering measurements: In Figure 2 the dispersion of data 
BOD 5 is shown with a minimum value of 31 mg / L and a 
maximum of 438 mg / L. The rank is very wide: 407 mg / L. 


500 



0 100 200 300 400 500 600 


Figure 2. Dispersion of BOD 5 in the influent of the 
treatment plant. 

According to Figure 2 some measurements that at first glance 
are low and high values are observed. In order to identify 
outliers and extremes, in Figure 3 the box and whisker plot is 
presented for BOD 5 



Figure 3. Box diagram of BOD 5 in the influent of the 
treatment plant. 


Results quartiles: Qi = 146 corresponds to the bottom of the 
box 183, Q2 = 183 or medium and Q 3 = 223 top of the box. 
The interquartile range = 77, lower limit = 30.5 whisker is 
rounded to 31 and higher = 338.5 is rounded to 339. 
According Quevedo and Perez (2008) the interquartile range 
represents the 50 percent dispersion of the core data. The 
upper outliers seen in Figure 3 are far more than 1.5 lengths of 
the third quartile. These outliers suggest a positive skewness 
of the distribution. A displaced median center of the box as 
shown in Figure 3 shows also positive skewness. The lowest 
and highest values that resulted from the analysis are shown in 
Table 3. 


Table 3. Extreme values for BOD 5 



Case 

Number 

Value 

BOD Higher 1 

196 

438.00 

2 

113 

431.00 

3 

367 

427.00 

4 

103 

404.00 

5 

294 

380.00 

lower 1 

207 

31.00 

2 

203 

48.00 

3 

212 

58.00 

4 

201 

64.00 

5 

202 

66.00 


All major values given in Table 3 were above the upper limit 
determined in quartile analysis: 339. Therefore they are 
considered as outliers; while lower values are located below 
the weak concentration shown in Table 1; that is less than 110 
mg / L. Only the case 207 was similar to the lower limit, 
although the box and whisker plot indicates that it is out of the 
limit. All this because the value was 30.5 but it was rounded 
up to 31. The last situation is not a problem for a biological 
treatment system. In the same table 1, top concentration 
values are observed. 

The statistical literature mentions that it is wise to conduct a 
study before removing outliers. Since they are not eliminated, 
the conclusions may be wrong, or the results could also be 
deformed. In the present study we observed atypical 
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measurements, then it was applied a connection to BOD 5 and 
COD. 

According to Table 1, the value of organic matter 
concentration average for wastewater is 220 mg / L. The 
average BOD 5 was 189, this value is located between the 
weak and mean value, namely 110 and 220. But the value of 
the average is closer to 220 to 110; therefore it is classified as 
medium concentration for organic matter, although Figure 2 
shows values from below weak concentration to above strong 
concentration. 

B. Chemical Oxygen Demand 
Figure 4 shows the histogram. 
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Figure 4. Histogram for COD in the influent of the 
treatment plant. 


Shape measures: The asymmetry of the distribution was 
positive: 0.58. According to Figure 4 the distributions of the 
tail is larger with values above average. 4 peaks can be 
observed and 355 resulted the more frequently class, followed 
by 472, 433 and 394. The kurtosis was platykurtic. 

Position measures: defined statistical were the following; with 
a confidence interval of 95 percent is estimated that the values 
are located between 396 lower limit and 417 upper limit. 407 
average, Trimmed Mean at 5 percent 403: for outliers. 
Median 398, mode 380. As the BOD 5 the Trimmed Mean iss 
closer to the arithmetic median. 

In order to calculate in an approximate way the measurements 
number for each classification according to Table 1. The 
cumulative frequency percentage is displayed. 
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Figure 5. Accumulated Frequencies Graph in percentage of 
the COD in the influent of the treatment plant. 

According to Figure 5 approximately 12.38 percent of the 
measurements are less than the weak concentration: 65; while 
82.40 percent are less than the average 431. 100 percent of the 
values were lower than the maximum concentration shown in 
Table 1. 

Another observation is that the percentage difference between 
the weak and average concentration resulted on 70.02 
percent: 366 values. This confirms that most measurements 
are of the average concentration type (Table 1). While in 17.6 
percent: 92 were between average and high concentration. 

Dispersion Measures: in Figure 6 the dispersion of the data of 
COD with a minimum value of 81 mg / L, and maximum value 
of 941 is shown. As well as the BOD 5 variability is also very 
wide: 860 mg/L. 
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Figure 6. Dispersion of COD on the treatment plant 
influent. 

In order to identify atypical measurements and outliers, the 
box and whisker plot is presented in Figure 7. 



Figure 7. Box and whisker diagram for COD in the treatment 
plant influent. 

Quartiles results were: Qi = 325, Q2 = 398y Q 3 = 478. 

The interquartile range =153, lower limit whisker = 96 and 
upper limit whisker = 708. In Figure 7 can be observed that 
only the 207 case was below the lower limit. The lowest and 
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highest values that resulted from the analysis are shown in 
Table 4. 


Table 4. Outliers values for COD 



Case 

Number 

Value 

COD higher 1 

192 

941.00 

2 

344 

887.00 

3 

420 

841.00 

4 

367 

815.00 

5 

328 

812.00 

lower 1 

207 

81.00 

2 

426 

105.00 

3 

204 

126.00 

4 

140 

133.00 

5 

202 

152.00 


As observed on the class mark ( "X" axis ) Most 
measurements indicate that wastewater are of domestic type, 
since the ratio ranges BOD 5 /COD resulted from 0.17 
Although values exceeding 0.8 are included in the right side 
of the histogram . The Results match with the atypical 
observations shown in Figures 3 and 6. This suggests that 
atypical measurements are rather isolated wastewater 
discharges from industrial processes into the drainage system, 
and therefore, it is prudent to establish a discharge control of 
commercial, industrial and services wastewater. The 
mentioned above, in order to protect the municipal pipelines 
system and the operation of the treatment plant (EPA, 1987; 
CNA and IMTA, 2000; UNAM, 2000). 

To identify the number of measurements of industrial 
wastewater, in figure 9 cumulative frequency percentage is 
shown. 


The higher values shown in table 4 were above the upper limit 
defined in the quartiles: 708 so they are considered atypical 
and outlier measures. 

The lower values shown in table 4 were below the weak 
concentration: 250 mg/L. which is indicated in Table 1. 

Due to atypical and outlier values were observed, the 
arithmetic mean does not represent an actual value. 
Therefore, the Trimmed Mean at 5 percent was considerate: 
403 mg/L. The classification resulted average according to 
Table 1 

C. Relationship between BOD 5 / COD 

According to Metcalf & Eddy (1991) and CNA and IMTA 
(2000) there is a relationship between the BOD 5 / COD which 
varies between 0.4 and 0.8. The range of these values 
indicates that it is wastewater of the domestic type; while for 
industrial waters this ratio is greater. 



Figure 9 cumulative frequency graph in BOD 5 / COD ratio 
percentage of the treatment plant influent. 


Applying to relationship to the data the following results were 
obtained: average 0.48, median 0.45 and 0.50 mode. 

The condition for a positive asymmetric distribution, as 
previously indicated is media> median. The minimum value 
was 0.17, maximum value 1.27 and 1.10 range. Figure 8 
shows the results. 



Figure 8 . Histogram ratio of BOD 5 /COD in the treatment 
plant influent 


According to Figure 9 a little less than 34.88 percent of the 
measurements are less than 0.4: 182. While the 96.37 percent: 
about 504 are less than the upper limit for domestic 
wastewater. 3.63 percent resulted measurements above 0.8. 
As mentioned before, this last percentage suggests that 
industrial process wastewater is downloaded to the drainage 
system. This coincides with the analysis made between BOD 5 
and COD parameters. 

C. Identification of possible infringing users 

A criterion to identify inferring users is determined by the 
extinct Secretariat of Commerce and Industrial Development. 
This branch carried out a categorization of users regarding the 
types of pollutants generated. That aforementioned document 
was called: Mexican Classification of Activities and Products. 
This was published by the National Water Commission (CNA 
and IMTA, 2000). Table 6 shows a segment of the mentioned 
table. 
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Table 6. Industries that generate high levels of contaminants. Adapted UAM (1997) and EPA (1987) 


cited (CNA e IMTA, 2000). 


Process 

Descripcion 

EC 

HP 

F and O 

SS 

TSS 

BOD 

COD 

3111 

Meat industry 



X 


X 

X 

X 

3112 

Manufacture of dairy products 



X 


X 

X 

X 

3113 

Canned food processing 

0 


X 

X 


X 

X 

3114 

Grain milling 



X 

X 

X 



3115 

Bread making 



X 

X 



X 

3116 

Making tortillas 

0 

0 


X 

X 

X 

X 

3117 

Edible oils and fats 



0 




X 

3118 

Sugar industry 

0 

0 



X 

X 

X 

3119 

chocolat manufacturing 



X 

X 

X 

X 

X 

3120 

Food products 



X 

X 

X 

X 

X 

3121 

Animal feed 



X 


X 

X 

X 

3122 

Beverage industry 

0 


X 


X 

X 

X 


Table 6 industries that generate pollutants, among those in 
which BOD 5 and COD are observed. This information can be 
used as an initial assessment of the potential users that 
discharge pollutants above the maximum limits permitted by 
the standard. Consequently, the water operator corporation of 
the city can start with the inspection of these industries. 

Then propose a treatment system where the generation of 
pollutants is located, so that it meets the concentration 
indicating the standard of wastewater discharge to municipal 
sewer systems: NOM -002- ECOL - 1996 published in the 
Official Journal of the Federation (DOF). 

This indicates 200 mg / L., for BOD 5 daily average for the 
discharge to the sewer system (DOF, 1988) Table 7 shows the 
summary of results of the analysis of raw wastewater in the 
treatment plant influent. 


Table 7. Summary of wastewater classification results. 


Concept 

bod 5 

COD 

Classification of 
typical wastewater 

Average 
with industrial 
discharges 

Average 
with industrial 
discharges 
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iv. Conclusion 

Descriptive statistics were applied to the data; analyzed 
pollutants were BOD 5 , COD in the influent to the treatment 
plant. Wastewater state was then determined according to the 
classification shown in Table 1. 

Possible discharges of industrial wastewater were identified 
as well as processes with high organic matter content. 

The application of basic statistical to the characterization of 
wastewater yields important results, so that those responsible 


for the operation of the treatment system may recommend to 
the water utility, the types of industries and processes to be 
monitored. In order to establish a control of pollutants prior to 
discharge into the municipal sewage. Performing a wide 
database of the characterizations in the influent wastewater 
treatment system, as well as analytical results provide a basis 
for future comparisons. In order to identify significant 
deviations in the concentration of pollutants. Whit this, it is 
possible to detect potential offenders. 
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